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Abstract. This paper addresses two central problems for probabilistic process- 
ing models: parameter estimation from incomplete data and efficient retrieval of 
most probable analyses. These questions have been answered satisfactorily only 
for probabilistic regular and context-free models. We address these problems for 
a more expressive probabilistic constraint logic programming model. 

We present a log-linear probability model for probabilistic constraint logic 
programming. On top of this model we define an algorithm to estimate the pa- 
rameters and to select the properties of log-linear models from incomplete data. 
This algorithm is an extension of the improve d iterative scaling algorithm of 



Delia Pietra, Delia Pietra, and Lafferty (1995). Our algorithm applies to 



linear models in general and is accompanied with suitable approximation meth- 
ods when applied to large data spaces. Furthermore, we present an approach 
for searching for most probable analyses of the probabilistic constraint logic 
programming model. This method can be applied to the ambiguity resolution 
problem in natural language processing applications. 



1. Introduction 

Rabiner (1989) identified three basic problems of interest that must 
be solved for a Hidden Markov Model to be useful in real-world speech 
recognition applications: the parameter estimation problem, the optimal 
state sequence problem and the observation sequence probability problem. 
These problems generalize to arbitrary probabilistic symbol processing 
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models in various real-world applications in an obvious manner. The first 
two problems can be stated in a more general way as follows. 

1. Let an unanalyzed observation sequence O = Oi, . . . On and a prob- 
abilistic processing model with parameter set A be given, and sup- 
pose that the value of A is unknown and O forms a random sample 
from the distribution involving A, how can the model parameters A 
be estimated? 

2. Given Oi and A, how can the most probable analysis of the input 
Oi be found efficiently? 

Recent interest in probabilistic models of natural language processing 
can be attributed to the fact that solutions to the above-mentioned general 
problems can lead quite directly to effective, but conceptually simple and 
mathematically clear solutions to various problems in natural language 
processing. 

This connection can be illustrated with the problem of ambiguity reso- 
lution (or disambiguation or parse ranking) as follows: Grammars describ- 
ing a nontrivial fragment of natural language may attach a large number 
of different analyses to sentences of reasonable length. Since not all of 
these analyses are in accord with human perceptions, there is clearly a 
need to distinguish more plausible analyses of an input from less plau- 
sible or totally spurious ones. The simple but effective idea adopted in 
probabilistic grammars is to connect the plausibility of an analysis with 
its probability. In this vein the correct, i.e., most plausible analysis of a 
string is assumed to be the most probable analysis of the string. A solution 
to problem 1 will adapt the model parameters A to the input corpus O 
and thus justify the assumption that the correct parse of a string Oi is the 
most probable parse of Oi as produced by the grammar parametrized by 
A. A solution to problem 2 will yield an algorithm to search for the most 
probable parse of a given input string Oi as produced by a probabilistic 
grammar with parameter set A. 



Most popular approaches to solving these problems in the area of nat- 
ural language processing are based on Baum's maximization technique. 



which is known as the "Baum- Welch algorithm" ( 


Baum and Eagon 1967 ; 


Baum, Petrie, Soules, and Weiss 197C; Baum 1972 


). This algorithm esti- 



mates the parameters of a Hidden Markov Model, i.e., a stochastic regular 
grammar, in a framework of maximum likelihood estimation from incom- 
plete data. This means, the parameters are iteratively reestimated until 
convergence to a set of values which locally maximize the likelihood func- 
tion, i.e., the probability that the model assigns to the given unanalyzed 
observation sequence. In this sense the model parameters are adjusted 
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to best describe a given observation sequence. The estimation algorithm 
can be defined inductively forwards and backwards, yielding the efiicient 
"forward-backward algorithm"^. Baker (1979) generalized this algorithm 
to the so-called "inside-outside algorithm"]^ which efficiently estimates 
the parameters of a stochastic context-free grammars. Both algorithms 
are special instances of the EM -algorithm for maximum-likelihood es- 
timation from incomplete data ( Dempster, Laird, and Rubin 1977 ). A 
dynamic-programming approach similar to the one used in the efficient 
versions of the parameter estimation algorithms can be used to find the 
most probable analysis of stochastic context-free a nd stochastic regular 
grammars and is known as the "Viterbi- algorithm" ( Viterbi 1967 ). 

The class of algorithms based upon Baum's maximization technique 
includes not only regular and context-free versions but recently has been 
extende d to, e.g., stochastic conte xt-free grammars with bracketi ng con- 
straints (Pereiraand Schabes 1992 ) and feature-based c onstraints (Briscoe 
and Waegner 1992| ), stochastic depencendy grammars ( Carroll and Char- 
niak 1992|) and sto chastic lexicalised tree-adjoining grammars ( Rcsnik 
1992| ; Schabes 1992 ). Despite the generality of the algorithm, there are 



clear restrictions on the expressivity of the probabilistic processing mod- 
els the algorithm can be applied to. Even if the structural operations of 
the probabilistic processing model may be sensitive to contextual features, 
this context-sensitivity has to be internal to the structural elements com- 
bined. The combination process itself has to be context-free, i.e., in terms 
of probability theory, different stochastic derivation choices at the same 
time-step have to be independent of the history of the derivation process 
and also independent of one another. 

This fact poses a problem for attempts to build stochastic versions of 
grammars which are more expressive than context-free. The grammars we 
are interested in here are constraint logic grammars (CLGs), i.e., highly 
expressive constraint-based grammars formalized in a (Turing-)powerful 
framework of constraint logic programming (CLP)^. This treatment of 



iSee 
2 See 



Rabiner (1989) for a tuto rial. 

|Lari and Young {199C ) and Jelinek, Lafferty, and Mercer (1990) for 



introductions. 

■^CLP provides one possible approach to an operational treatment of various purely 
declarative grammar frameworks by an embedding of arbitrary logical languages into 
constraint logic programs. CLGs thus are simply understood as grammars formulated 
by means of a suitable logical language which can be embedded as a constraint language 
into a CLP scheme. Examples for an embeddin g of feature-based logi cal languages into 



the C LP sc heme of Hohfcld and Smolka (198^ ) are the approaches of 
(1993| ) and K56tz (1995| 



Dorrc and Dorna 
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CLGs as special applications of CLP will allow us to refer in the follow- 
ing to the general framework of CLP. Stochastic versions of CLP exhibit 
a context-sensitivity problem in that incompatible variable bindings can 
lead to failure derivations in dependence of the (simultaneous) history 
of the derivation process. Eisele (1994), Miyata (1996| ) and Osborne and 



Briscoe (1997), who attempt to adapt Baum's maximization technique to 



estimate the parameters of their stochastic constraint-based models, try 
to escape from this problem by redefining the derivation process of their 
respective probabilistic processing model to include only successful deriva- 
tions and by renormalizing the probability distribution over derivations. 
Unfortunately, this move contradicts the basic independency assumption 
made in the parameter estimation algorithm and prohibits an application 
of Baum's technique as an optimization algorithm in maximum likelihood 
estimation of stochastic CLP. 

To date to our knowledge there is no approach which solves problems 1 
and 2 satisfactorily for a probabilistic model of CLP. However, an excellent 
starting point is the approach to "stochastic attribute- value grammars" of 
Abney (199^ ). Abney ])resents a probabilistic model of grammars which 
produce analyses in form of dags (directed acyclic graphs) by defining 
the probability distribution over these dags as a random field. For such 
probability models algorithms to estimate parameters from complete data 
exist ( Delia Pietra, Delia Pietra, and Lafferty 1995| ) and are shown to 
be applicable to the stochastic grammar model. However, complete data 
means large corpora of costly manually analyzed, i.e., hand-parsed, data. 
So one open question is how to estimate parameters from incomplete, 
unanalyzed input, i.e., from simple corpora of natural language strings. 
Furthermore, if the intended application is ambiguity resolution, a second 
question is how to use the structure of the probabilistic model to guide 
the search for the most probable analysis of a string rather than simply 
listing all possible analyses and choosing the best one. 

The aim of this paper is to present a probabilistic model of CLP and 
to couple this with an algorithm to induce the parameters and properties 
of such models from incomplete data and with an algorithm to search 
for best analyses. Our approach to probabilistic CLP is based on a log- 
linear probability model, i.e., a powerful expo nential probability mode l 
well-known in probabilistic network modeling ( |Geman and Geman 1984 ; 
Ackley and Hinton 1985 ; Pearl 1988 ). On top of this probability model we 
define an algorithm to estimate the parameters and to select the properties 
of log- linear models from incomplete data. This induction algorithm is an 
extension of the improved iterative scaling algorithm of Delia Pietra, Delia 
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Pietra, and Lafferty (1995) adjusted to incomplete data. The techniques 
developed in this context apply to log-linear models in general and are 
accompanied with suitable Monte Carlo approximations when applied to 
large data spaces. For the intended CLP application we build upon the 
CLP scheme of Hohfeld and Smolka (1988). In this context we present 
an algorithm to search for most probable analyses of the probabilistic 
CLP model. This algorithm is formulated as a probabilistic version of 
Earley deduction and can be applied to the ambiguity resolution problem 
in natural language processing applications^. 

The remainder of this paper is organized as follows. Section ^ intro- 
duces the basic formal concepts of CLP. Section ^ discusses in more detail 
the above-mentioned context-sensitivity problem in case of parameter es- 
timation from incomplete data. Section ^ presents a general log-linear 
model for probabilistic CLP. Problem 1, i.e., parameter estimation of log- 
linear probability models from incomplete data is treated in Sect. ^ An 
algorithm for automatically selecting properties of log-linear models in 
the presence of incomplete data is also presented. Section |^ discusses the 
problem of estimating the terms in the formula presented in Sect. ^ in 
the presence of large sample spaces by Monte Carlo methods. Problem 2, 
i.e., methods to search for most probable analyses for probabilistic CLP, 
is approached in Sect. |^ in the form of a probabilistic version of Earley 
deduction. Section || gives some concluding remarks and discusses the 
relation of probabilistic CLP to other probabilistic processing models. 



2. Constraint Logic Programming 

In the following we will quickly report the basic definitions of the CLP 
scheme of Hohfeld and Smolka (198S). Thi s scheme is a powerful extension 
of conventional logic programming (se e Lloyd (1987 )) and also of the 
CLP scheme of Jaffar and Lassez (1986 ) by an incorporation of arbitrary 
constraint languages and corresponding constraint solving methods into 
logic programming languages. 



^ Even if a solution to our two problems can be seen as a necessary prerequisite for 
further applications such as grammar induction or language modelling, it is a necessary 
and sufficient prerequisite only for the application of ambiguity resolution. For the 
application of grammar induction, the question of how to impose useful constraints 
on the form of possible analyses in order to reduce the number of parameters to be 
estimated will become important. In language modelling applicatons, a shift of focus 
from imposing a probability distribution over a given set of analyses to imposing a 
probability distribution over input strings is made. 
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For example, Prolog is obtained by employing equations between first 
order terms as constraint language and by interpreting these equations in 
the Herbrand universe. The corresponing operational semantics of SLD- 
resolution can be seen to rely on a constraint solver which solves term 
equations in the Herbrand universe by term unification. 

A constraint logic program V is then defined with respect to an implicit 
basi c constraint language C an d its relational extension TZ{C) as follows 



(see iHohfeld and Smolka (1988| )) 



Definition 1 (definite clause specification). A definite clause specifica- 
tion V in TZ{C) is a set of definite clauses of the form 

Bik...kBnk(f) 

where A, Bi, . . . , _B„ are 7?.(£) -atoms, r(x) is an TZ{C) -atom iff r E TZ 
is a relational symbol with arity n and x is an n-tuple of pairwise distinct 
variables, and cj) is an C -constraint ranging over the variables mentioned. 

Constraint languages have to be closed under variable renaming, closed 
under intersection, and the satisfiability problem of such languages has to 
be decidable. 

A goal is defined as a possibly empty conjunction of C -constraints and 
??,(£) -atoms. Relying on conventional logical terminology, a V -answer 
of a goal G for a program V can be defined as a satisfiable C -constraint 
(j) s.t. the implication (/) — > G is a logical consequence of V . 

SLD-resolution is generalized by performing goal reduction only on the 
'R{C) -atoms and solving conjunctions of collected C -constraints by the 
C -constraint solver. Goal reduction is managed by a binary relation 
on the set of goals as follows (V denotes the finite set of variables in 
the query and V(-) is a function assigning to a constraint the finite set of 
variables constrained by it). 

A & G-—> FSzGifA*— Fisa variant of a clause in V 

s.t. (vuv(G)) nv(F) c y{A). 

A second rule takes care of constraint solving for the C -constraints 
appearing in subsequent goals. The rule takes the conjunction of the C - 
constraints from the reduced goal and the applied clause and gives, via 
the black box of a suitable £ -constraint solver, a satisfiable £ -constraint 
in solved form if the conjunction of £ -constraints is satisfiable. The con- 
straint solving rule can then be defined as a total function — ^ on the 
set of goals as follows {CS{-) denotes the £ -constraint solver as a function 
on the set of £ -constraints). 

(/) & 0' & G^ 0" SzG if CS{<j> k(t>') = CS{<j>"). 
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Moiiteid and ymoika (1988) show that this generaUzed SLD-resolution 
method is a sound and complete method for inferring V -answers. For 
the foUowing discussion it will be convenient to view this operational 
semantics as a search of a tree. For a given query and a given program, 
the search space determined by the derivation rules and — ^ can be 
described as a derivation tree as follows. 

Definition 2 (derivation tree). A derivation tree determined by a query 
Gi and a definite clause specification V has to satisfy the following con- 
ditions: 

1. Each node is either a relation-node or a constraint- node. 

2. The descendants of every relation-node are all constraint-nodes s.t. 
for every -resolvent G' obtainable by a clause C from goal G in 
a relation-node, there is a descending constraint-node labeled by C 
and G' . 

3. The descendants of every constraint-node are all relation-nodes s.t. 
for every unique — ^ -resolvent G&L(j)" obtainable from goal G&Kp&Kp' 
in a constraint-node, there is a descending relation-node labeled by 
Gkcj)". 

4. The root node is a relation-node labeled by Gi . 

5. A success node is a terminal relation-node labeled by a satisfiable 
C -constraint. 

Successful derivations correspond to certain subtrees of derivation trees 
and can be defined as proof trees as follows. 

Definition 3 (proof tree). A proof tree for a query Gi from V is a sub- 
tree of a derivation tree determined by Gi and V and is defined as follows: 

1. A relation-node of the proof tree is a relation-node of the supertree 
and takes one of the descendants of the supertree relation-node as 
its descendant. 

2. A constraint-node of the proof tree is a constraint-node of the su- 
pertree and takes the unique descendant of the supertree constraint- 
node as its descendant. 

3. The root node of the proof tree is the root node of the supertree. 

4. The terminal node of the proof tree is a success node of the supertree 
labeled by a satisfiable C -constraint, called answer constraint. 

3. Baum's Maximization Technique and Probabilistic CLP 

One straightforward way to add statistical information to symbol pro- 
cessing models is to define the derivation process of such models 
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stochastic process as follows: Make a stochastic choice at each derivation 
step and assume the stochastic choices to be independent of each other. 
Calculate the probability of a derivation as the joint probability of the 
independent stochastic choices made and the probability of an input as 
the sum of the probabilities of its derivations. This is the probabilistic 
model underlying, e.g., Hidden Markov Models or stochastic context-free 
grammars. The parameters of such models, i.e., the probabilities of the 
stochastic choices, can b e estimated by Baum's maximization technique 
(Baum and Eagon 1967| ; |Baum, Petrie, Soules, and Weiss 1970| ; 
1972|) . 



The basic formal concepts of this technique can be described in an 
abstract way as follows: Let 11 = {7r.y } be the parameter set of an abstract 
probabilistic symbol processing model where TTy > and '^jTTij = 1. 
The variable i ranges over the types of choices that the stochastic process 
makes and the variable j ranges over the alternatives to choose from 
when a choice of type i is made. Furthermore, let y denote an input of 
the probabilistic processing model, i.e., an observation sequence, and let 
X denote an output of the model, i.e., an analysis, and let Y{x) = y he the 
unique observation corresponding to analysis x and X{y) = {x\Y{x) = y} 
be the set of analyses of observation y. Finally, let Vij{x) be the number 
of selections of alternative j for a choice of type i in analysis x. Then 
the probability of an analysis can be calculated as the product of the 
probabilities of the stochastic choices made in producing it: 

The probability of an observation then is the sum of the probabilities of 
its analyses: 

= '}lxex{y)P{^'^'^) 

The purpose of Baum's maximization technique is to find maximum likeli- 
hood parameter values, i.e., {tt^ } which maximize the likelihood function 
-PI"') = WyPiy^''^) fo'" ^ given y-sample. To this end Baum defines a 
transformation t of tt into itself, which looks in its basic form as follows: 

and T yields an iterative algorithm where each step is defined by 

This algorithm is hill-climbing, i.e., it can be shown that P{t{tt)) > P{tt) 
unless T(7r) is a critical point of P or equivalently is a fixed point of r. 

Attempts to apply this algorithm and the underlying abstract model 
directly to a model of probabilistic CLGs or CLP were presented, e.g., by 
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Eisele (199| ), [Miyata (1996| ) and [Osborne and Briscoe (1997D . A detailed 
critique of such attempts with respect to the problem of parameter estima- 
tion from complete data can be found in Abney (1996). What Abney calls 
the Expected Rule Frequency (ERF) parameter estimation method can be 
seen as a special case of Baum's maximization technique. In order to show 
that Baum's general algorithm fails as an optimization technique for the 
maximum likelihood problem for probabilisitic CLP, we simply can give 
a counterexample using a deterministic program. In this case, parameter 
estimation from incomplete data using Baum's method is equivalent to 
using the ERF method. This point shall be made explicit in the following. 

Let us apply the above-defined abstract model to a simple deterministic 
constraint logic program. The stochastic choices of the abstract model cor- 
respond to application probabilities of definite clauses in the generalized 
SLD-resolution procedure; the alternatives to choose from when an atom 
is selected in goal reduction are the different clauses defining the selected 
atom. In the following example (see Fig. |^) , each clause will be annotated 
by a choice-alternative pair indicating a probabilisitic parameter tt^ . 



11 s{Z) <- 


- P{Z) & 


21 p{Z) ^ 


- Z = a 


22 p{Z) ^ 


- Z = b 


31 q{Z) ^ 


- Z = a 


32 q{Z) ^ 


- Z^b. 



q{Z). 



Figure 1. A sample program 



The relational atom s{Z) is defined uniquely in clause 11. The atoms 
p{Z) and q{Z) each are defined in two different ways, which for the sake 
of the example are considered to be incompatible. For a selection of atom 
p{Z) one can choose between clauses 21 and 22 in a goal reduction step, 
whereas for a choice of atom q{Z) the alternatives to choose from are 
clauses 31 and 32. This program is deterministic for the queries s(Z)&Z — 
a and s{Z) &z Z = b. This means, there is only one proof tree from the 
above program for each query (see Fig. H). The proof tree xi for the query 
s{Z) Z — a uses clauses 11, 21 and 31 and yields answer constraint 
Z — a; the proof tree X2 for the second query uses clauses 11 , 22 and 32 
and gives answer constraint Z = b. 

Let us now consider the application of Baum's maximization technique 
to estimate the parameters of such a probabilistic CLP model (see Fig. 
1^). An input corpus consisting of the three queries yi : s{Z) Sz Z = 
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r, c 

X\ : I 

11, p{Z) Sz q{Z) k Z = a 

r, c 

21, q{Z)kZ = a 

I r,c 

31, Z = a 

s{Z)kZ = b 

X ' I ^' 

11, p{Z) k q{Z) Sz Z = b 

r, c 

22, ^(Z) kZ = b 

I r,c 

32, Z = 6 



Figure 2. Proof trees 
y X € X{y) p{x\y) 


from 

iVii 


sample progi 

N21 N22 


■am 

N31 N32 


yi xi 1 

2/2 Xi 1 

2/3 a;2 1 


1 • 1 
1 • 1 
1 • 1 


1-1 1-0 
1-1 1-0 
1-0 1-1 


1-1 1-0 
1-1 1-0 
1-0 1-1 


X^j/ X^ft ~ 


3 
3 


2 1 

3 3 


2 1 

3 3 




1 


2/3 1/3 


2/3 1/3 



Figure 3. A sample estimation 



a, 2/2 : s{Z) k Z = a and t/a : s{Z) k Z = b will yield the correspond- 
ing unique proof trees xi G X{yi), Xi G X{y2) and X2 S -^(2/3)- The 
conditional probabilities p{x\y) for x € -'^(2/) will be 1 in each case since 
there is a unique proof tree for each query. Thus for the calculation of 
Nij = X^a,p(a;|2/)t'ij(a;), the expected number of occurences of clauses in 
proof trees, we simply have to count and can ignore the respective proba- 
bilities of the proof trees. The algorithm then will give unique estimated 
parameter values TTy = ^v-'^m \ immediately. 

If we now consider the calculation of the probability distribution over 
the proof trees of such a probabilistic CLP model, we see that in contrast 
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to the above-defined abstract model we cannot simply calculate a product 
for each proof tree. Instead, in order to get a proper probability distri- 
bution over proof trees, we have to do an additional normalization. For 
example, if the sum of the unnormalized probabilities of the proof trees 
under the estimated model, tt) +p{x2;tt) = 4/9 + 1/9 = 5/9, is used 
as a normalization constant, then we will get a normalized probability 
distribution over proof trees, p'{xi; tt') = 4/5, p' (x2\ tt') — 1/5, yielding a 
normalized likelihood of our training corpus P'{tt') = (4/5)^ • 1/5 = .128. 
Note that the normalized probability distribution no longer refers to 
specific parameter values. In fact, there is no analytical solution to the 
problem of finding parameter values tt' for the program of Fig. Jl| which 
yield probability distribution p' over the proof trees of Fig. How- 
ever, given the same preconditions, we can find a probability distribu- 
tion p"{xi;tt") = 2/3, p"{x2\tt") = 1/3 which yields a higher likelihood 
P"{-k") = (2/3)2 • 1/3 = .148. This contradicts the assumption that the 
parameter values estimated by Baum's technique are the requested maxi- 
mum likelihood estimates for a probabilistic CLP model as defined above. 



4. A LoG-LiNEAR Model for Probabilistic CLP 

The above-discussed approach based a probability distribution over 
proof trees on a definition of the derivation process of CLP as a (context- 



free) stochastic process. An alternative, presented by Abney (1996 ), for 
his model of stochastic attribute- value grammars is to define a probability 
distribution over dags as a random field. This probability model does not 
build on any underlying stochastic process but rather on the underlying 
graphical structure of the analyses produced by the model. Random fields 
can be seen as special instances of general log-linear probability models. 
Such a model can be defined as follows. 

Definition 4 (log-linear distribution). A log-linear probability distribu- 
tion p\.u on a set is defined s.t. for all to ^ ^l: 

Px.,{uo) - Zx..-^e^--^^'^po{uj) 

where Z\.i, = Xiwen ''^'^^Po(^) normalizing constant, 
A = (Ai, . . . , A„) is a vector of log-parameters s.t. A G IR", 
X = (Xii ■ • ■ iXn) is a vector of properties, 
V = (i^i, . . . , i/„) is a vector of property-functions s.t. for each 

: il — > IN, Vii^Lo) is the number of occurences of property Xi in ^, 
A • v{ijj) is a weighted property-function s.t. A • — X)"=i ^i^ii'^)) 
Po is a fixed initial distribution. 
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In analogy to stochastic attribute-values grammars, we can define a 
probability distribution over proof trees as a special log-linear model. 
The special instance of interest is simply a log-linear distribution on the 
countably infinite set of proof-trees for a set of queries to a program. Such 
a distribution is determined by a vector of properties and a corresponding 
vector of log-parameters. Properties could be defined, e.g., as subtrees of 
proof trees. For the moment, we can leave an exact definition of properties 
aside and refer to an assumed vector of property- functions. 

The form of log-linear models can be rationalized as an example of an 
exponential family of probability functions. From this viewpoint this 
model can be seen as just a very flexible probability model defining the 
probability of a configuration to be proportional to the product of weights 
assigned to arbitrary properties of the configuration. 

This can be put in the form of Definition ^ by replacing proportionality 
by a constant and parameters tt^ by log-parameters A.^ — log tt^. 



= c 




= z- 




= z- 




= z- 





Another way to rationalize the form of the log-linear model is as a 
maximum entropy probability distribution. From this viewpoint we do 
statistical inference and, believing that entropy is the unique consistent 
measure of the amount of uncertainty represented by a probability distri- 
bution, we obey the following principle: 

In making inferences on the basis of partial information we 
must use that probability distribution which has maximum 
entropy subject to whatever is known. This is the only unbi- 
ased assignment we can make; to use any other would amount 
to arbitrary assumption of information which by hypothesis 
we do not have. (Jaynes 1957) 

More formally, suppose a random variable X can take on values Xi,i = 
1, . . . , n and we want to estimate the corresponding probabilities pi,i = 
1, . . . ,n. All we have are expectations of functions fk{X), k — 1, . . . , m. 
The maximum entropy principle can then be stated as follows. 



PROBABILISTIC CONSTRAINT LOGIC PROGRAMMING 



13 



Maximize H{pi, . . . ,pn) — —Y^^^iPilogpi subject to the con- 
straints Y,'^=iPifk{xi) = Ffe,/c = l,...,TO and YA=iPi = 1- 
The solution we get for all pi, i = 1, . . . , n is: 



Pi 



En 
i— 



This result follows directly from a constrained optimization argument 
where the parameters are viewed as Lagrange multipliers: 

Let A denote the Lagrangian defined by A(pi, .. . ,p„, Aq, Ai, . . . ,Am) = 
- Er=i (p. logp,) + (Ao + 1) EtM - 1) + Ai Eti (P^h (x.) - i^i) + • • • + 

(log Pi + 1) + (Ao + 1) + Ai/i(a;i) H h Xmfm{xi). 

— Q>^0+J2T=l-^kfk{Xi) ^ 

Since = 1; have e"^" Z]"=i 6^"=! -^'=/'»(^») 

Define Z = X;r=i e^"=i then Aq = log Z'^ 

and = Z-^e'>^k=i>^kfkixO ^ 



Then 

Set ^A = 0, then p 



1. 



_ ^ktki^i) 



Log-linear models originated in statistical physics as flexible probabilis- 
tic models of equilibrium states of physical systems. Jaynes interpreted 
such Gibbs- or Boltzmann-distributions in a more abstract maximum- 



entropy framework (see Jaynes (1983)). Besides numerous applications 
in the area of natural language processing, log-linear models are also 
applied successfully in image processing (see the work on random fields 
initiated by Geman and Gcman (1984)) and are closely related to other 
probabilistic models such as Boltzmann machines (see Ackley and Hinton 



(1985)) or graphical models (see Pearl (1988)) 



The work presented in the following sections applies for the most part to 
log-linear models in general. We will refer for this discussion to Definition 
^. In case the property vector is fixed and clear from the context, the 
model will be written p\ to indicate the depencence on the parameter 
vector. Furthermore, it will be convenient to have a recursive definition 
of models based on property-functions which are extended by additional 
properties and corresponding parameters or by new parameters. 

Proposition 1. For each weighted property-funtion 4>{uj) — A • lyij-o), 
ip^Lo) = 7 • fi{uj) (with possibly v = fi), let (ip 4- 4')i^) — + 



Miller. Grcnandcr. and Abncv 1992: Abney 1996), word morphology f Delia Pictra 



^The applications include, beside others, prob abilistic grammar m odels fMark 



Delia Pietra, and Laffcrty 1995 ), m achine translatio n ( Bergcr, Delia Pietra, and Delia 
language modelling (|Rosenfeld 19961. part-of-specch taggin g (Ratna- 



Pietra 1991 ^ ^ , ^ , 

parkhi 1996| ), word correlations f|Bccferman. Berger. and Laffcrty 1997a ) and text 
segmentation (Beeferman, Berger, and Laffcrty 1997b| ) 
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an extended property-function (reducing to X + ^ in case v ^ p,). Then 
p^+^{uj) = Z^+^'^e^'^'^^p^iuj) where Z^+^ = Hu,(^q.^'^'''^^P<I'('^)- 

Proof. 

wen 



5. Inducing Log-linear Models from Incomplete Data 

Induction of log-linear models involves two problems: parameter esti- 
mation and property selection. In the following we will give a detailed 
presentation of solutions to these problems for the case of incomplete 
data. 



5.1. Parameter Estimation from Incomplete Data. An algorithm 
to estimate the parameters of general log-linear models from complete 
data has been presented by Delia Pietra, Delia Pietra, and Lafferty (1995). 
Their "Improved Iterative Scaling" algorithm is an extension of the "Gen- 
eralized Iterative Scaling" algorithm of Darroch and Ratcliff (197^ ) espe- 
cially tailored to estimating models with large parameter spaces. The 
algorithm is a technique for maximum likelihood estimation for log-linear 
models from complete data, i.e., it addresses the problem of maximizing 
the complete-data log-likelihood function log p{x)P^'^^ for a given em- 
pirical distribution p{x) over complete data x. The solution to this prob- 
lem is equivalent to the solution to the maximum entropy problem subject 
to linear constraints, i.e., the problem of maximizing the entropy H{p) 
subject to the constraints '^,j.p{x)fk{x) = '^j.pix)fk{x), k — . . . ,m 



with respect to the complete data empirical expectation (see Delia Pietra, 



Delia Pietra, and Lafferty (1995 )). In the language of constrained opti- 
mization, the maximum likelihood problem for log-linear models with re- 
spect to complete data is the dual to the maximum entropy problem for 



linear constraints with respect to complete data (see Berger, Delia Pietra 
and Delia Pietra (1996])). 



PROBABILISTIC CONSTRAINT LOGIC PROGRAMMING 



15 



However, the need to rely on large training samples of complete data 
may be inconvenient if complete data are complex and difficult to gather. 
This is the case for applications of CLP to natural language process- 
ing. Here complete data means several person- years of hand-annotating 
large corpora with detailed analyses of specialized grammar frameworks. 
Clearly, for such applications parameter estimation from incomplete data, 
i.e., unanalyzed input of natural language strings, is desirable. 

Unfortunately, Iterative Scaling will no longer work if the training 
data are incomplete. The incomplete-data log-likelihood takes the form 
^'^9YlyJ2x£X(y)Pi-'^)^ probability the model assigns to the data 

strings is the product of the probabilities of the strings and the probability 
of a string is calculated as the sum of the probabilities of its analyses. In 
contrast to the complete-data log-likelihood this function is non-concave 
(it involves a sum inside the logarithm) and cannot be maximized directly 
or uniquely. 

In the following we will show how the numerical algorithm of Delia 
Pietra, Delia Pietra, and Lafferty (1995| ) can be redefined in order to fit 
incomplete data. The new algorithm can be defined in the EM-framework 



of maximum likelihood estimation from incomplete data of (Dempster 



Laird, and Rubin 1977). Applying this framework to the problem of prob- 



abilistic CLP, we can assume the following to be given: 

• Observed, incomplete data y G y corresponding to a given, finite set 
of queries for a constraint logic program V , 

• Unobserved, complete data x G X corresponding to the countably in- 
finite set of proof trees for queries y from a constraint logic program 
V , 

• Functions Y : X ~^ y s.t. Y{x) ~ y corresponds to the unique query 
labeling proof tree x, and X : y ^ X s.t. X{y) = {a;| Y{x) = y} is 
the countably infinite set of proof trees for query y from a constraint 
logic program V , 

• Complete data specifications p\ s.t. p\{x) is a log-linear distribu- 
tion on X with given initial distribution po, fixed properties x "■''T-d 
property-functions vector v and depending on parameter vector X, 

• Incomplete data specifications L s.t. L{X) = logYly^y J2xex(y)P>^i^) 
= J2yey ^"9j2xex(y)P^i^) J2yey Pxiv) the log-likelihood of 
a fixed y -sample depending on parameter vector X. 

For the discussion of parameter estimation we will refer to a given 
vector of property functions. This is assumed to result from the prop- 
erty selection procedure defined in Sect. |5.2| , whereby for each property 
function vi some proof tree x Cz X s.t. Vi^x) > is assumed to exist. 
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Furthermore, we require px to be strictly positive on X, i.e., p\(x) > 
for all X G X. 

The problem of maximum likelihood estimation of log-linear models 
from incomplete data can then be stated formally as follows. 

Given a fixed y-sample and a set A = {A| p\{x) is a log- 
linear distribution on X luith fixed pq, fixed v and X £ H"}, 
we want to find the maximum likelihood estimate A* G A s.t. 
X* = argmax\(zf^L{X) . 

The key idea of the following approach is to itcratively maximize a 
strictly concave auxiliary function when the log-likelihood objective func- 
tion cannot be maximized analytically. An auxiliary function convenient 
for our problem can be defined as a two-place function A giving an es- 
timate of the improvement in the incomplete-data log-likelihood L when 
going from a model p\ to a model p-y+x- 

In the following p[f] = J2ujenPi^)fi'^) '^^^^ denote the expectation of 
a function / : O — > IR with respect to a probability distribution p on a 
set fl. 

Definition 5. Lei A e A, 7 G R". Then 

Ail + A) = Y^yeyi^ + [7 • H - Px lEti ^ie'''"*]) 
where 9i{x) = iy#ix) = J27=^ M^), kx{x) = sl^^Jkl^y 

By considering the first and second derivative of A, we see that A is 
strictly concave in the parameters. Strict concavity together with conti- 
nuity of the function and closedness of the parameter space directly gives 

us a unique maximum of A. 

Proposition 2. For each A G A, 7 e M"; ^(7-1- A) takes its maximum as 
a function of 7 at the unique point 7 satisfying for each ii,i = 1, . . . , n: 

Proof. 

yey j=i 



y&y 3jt% 
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d ,1 

o^i n 










xex 


= E(^^N] 

















yea' 

yey xex 

yey 
< 0. □ 

At the core of the proposed method hes the definition of an iterative 
algorithm for maximizing L which is constructed from the auxihary func- 
tion A. At each step of this "Iterative Maximization (IM)" algorithm a 
model based on parameters A is extended by a parameter vector 7 which 
gives the maximum estimated improvement in log-likelihood L, i.e., which 
is obtained by maximizing the auxiliary function ^(7 + A) as a function 
of 7. 

Definition 6 (iterative maximization). Let M : A K he a mapping 
defined by 

A4{X) = X E A s.t. X ^ 'J + X with 7 = argmax-f^^^ ^(7 -I- A) . 
Then each step of the Iterative Maximization Algorithm is defined by 

To show the central convergence properties of the IM algorithm, we first 
have to show some provisional results. Lemma ^ shows that the auxiliary 
function A{'-f -I- A) is a lower bound on L{j -|- A) — L{X), the difference 
in log- likelihood between the basic and the extended model, i.e., it is a 
conservative estimate of the improvement in log-likelihood. 
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Lemma 3. ^(7 + A) < ^(7 + A) - L(A). 
Proof. 



by Jensen's inequality 

= i^^^7-\i^ogp^+x{x)-logpx{x)))) 
yey xex{y) ^^^^ 

yey xex(y) 

+logpx{x) - logpx{x)))) 

= Y{k\b-v]-logpx[e^"\) 
yey 

> 'Y^{kx[y ■ i^] + 1 — pxle^'"]) since log X < X — 1 
yey 

yey xex 

n 

j/ej' xeA" i=i 

by Jensen's inequality 

n 

yGj' i=l 
= A(7 + A). □ 
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Lemma ^ shows that there is no estimated improvement in log- 
hkehhood in the origin. 

Lemma 4. A{Q + A) = 0. 

Proof. 

n 

yey xex i=i 

Lemma ^ shows that the critical points of interest are the same for A 
and L. 

Lemma 5. f^l^^A{t^ + X) = f^l^^L{t^ + X). 
Proof. 

-A{t^ + X) = -^(fc,[tT,.z.] + l-^(p,(a;)^i^.(^)e*^''^*^^^)) 
yey xex i=i 



dt 



t=o 



yey xex 

n 

yey xex i^l 

n 

A{t^ + x) = E(fcA[7-H-E(^'^(^)E^^(^)^^^°)) 

yey xex i=i 

yey 

^L{t-f + X) = E^^^^f E Pti+^i^)) 
yey xex(y) 

- E(( E pt,+x{^))-'j^ E e*^-''^^W(-)^,-+A) 

yey xex(y) xex(y) 

= E(( E p*7+a(^))-' E ^^^(^) 

yey xeX(y) xeX(y) 

xex 
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yey x(EX{y) xeX(y) 
xeX(y) x£X{y) 

yey 



d_ 

di 



t=o 



yey 



One central result of this section is stated in Theorem |6|. It shows 
the hill-climbing nature of the IM algorithm, i.e., the log-likelihood L is 
increasing on each iteration of the IM algorithm except at fixed points of 
M or equivalently at critical points of L. 

Theorem 6. For all A e A; L{M{\)) > L{X) with equality iff X is a fixed 
point of Ai or equivalently is a critical point of L. 

Proof. 

L{M{X))-L{X) > A{M{X)) by Lemma I 

> by Lemma ^ and definition of A4 . 

The equality L{M.{X)) = L{X) holds iff A is a fixed point of A4, i.e., 
A^(A) = 7 + A with 7 = 0. Furthermore, A is a fixed point of M iff 
7 = argmaxj^jsin A{'j + A) = 0, 
^ for all 7 e M" 
for all 7 e M" 

<^ for aU 7 e H" : ^ j^^^ L{t-f + A) = 0, by Lemma | 
<=^ A is a critical point of L. □ 

Corollary ^ implies that a maximum likelihood estimate is a fixed point 
of the mapping A4. 

Corollary 7. Let X* = argmax\^t^L{X). Then X* is a fixed point of 
M. 

Proposition ^ discusses the convergence properties of the algorithm. As 
with each application of the EM algorithm, we can show convergence of 
the IM algorithm to critical points of the incomplete-data log-likelihood 
function L. This means that the limiting parameter value can occur at a 
local, not only at a global maximum of L. This chaotic behaviour of the 
algorithm, i.e., the dependence of convergence on initial parameter values, 
must be treated as an empirical matter. 



t = argmaxtewiA{t^ + A) = 0, 
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Proposition 8. Let {AC"')} be a sequenee in A determined by the IM 
Algorithm. Then all limit points o/{A^'^-'} are fixed points of A4 or equiv- 
alently are critical points of L. 

Proof. Let {A'-'""-'} be a subsequence of {A^'^'} converging to A. Then for 
all 7 G R": 

A(7 + A(''")) < A(7('=") + A^''")) by definition of X 

< L(7('=") + A('="))-L(A('="') byLemmal 
= 7^(A(fe"+i)) -^(aC^")) by definition of IM 

< l(a('="+i)) -lcaC'")) 

and in limit as n — s- cx) for continuous A and L: A{'y+X) < L{X)~L{X) — 0. 
Thus 7 = is a maximum of ^(7 + A), using Lemma I and A is a fixed 
point of M. Furthermore, ^ \^^^ A{t^ + A) = L{tj + A) = 0, using 

Lemma ||, and A is a critical point of L. □ 

5.2. Property Selection from Incomplete Data. For the preceding 
task of parameter estimation we assumed a vector of properties to be 
given. However, exhaustive sets of properties can get unmanageably large 
for most applications. Let us consider the application of probabilistic CLP: 
One possible definition of properties of proof trees is as subtrees of proof 
trees. If we want to be as flexible as possible in the definition of subtree- 
properties and define a subtree of a proof tree to be an arbitrary subgraph 
of a proof tree, then the number of subtrees will grow exponentially in 
the number of proof tree nodes. Clearly, the set of candidate properties 
must be restricted by some quality measure. 

Property selection addresses two general issues. First, selecting promi- 
nent properties out of a set of possible properties can be seen as inducing 
a proper model that captures only the salient properties of the training 
data. This is one of the main tasks of statistical machine learning. Sec- 
ond, compact models will disallow overfitting the training data as could 
be done with models with one parameter per training element. Instead, 
a proper model will allow generalizations to new data and temper the 
overtraining problem. 

Depending on the definition of properties (for the CLP application, 
e.g., as connected subgraphs of proof trees s.t. each node of a subgraph 
has either zero descendants or the same number of descendants as the cor- 
responding node of the supergraph and the node sets of the subgraphs do 
not intersect) and the definition of a procedure to incrementally construct 
properties (e.g., by selecting from an initial set of query-node properties 
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and from properties built by performing one-step resolutions at terminal 
nodes of subtree-properties of the model), we start from a set of candidate 
properties for a log-linear model. 

But what should the above-mentioned quality measure be? We could 
take as the measure the improvement in log-likelihood when extending a 
model px based upon weighted property function (f) — X-v hy a single can- 
didate property^ c with parameter a to a model Pa+x based on extended 
property function ac + <j). In its basic form this quality measure would 
require a calculation of maximum likelihood estimates of extended mod- 
els via the IM algorithm for each candidate property. Clearly, this is not 



feasible for models with large p ara meter spaces. Following pella Pietra, 



Delia Pietra, and Lafferty (1995 ) or Berger, Delia Pietra, and Delia Pietra 



(1996 ), we could instead approximate the improvement due to adding a 
single property by adjusting only the parameter of this candidate and 
holding all other parameters of the model fixed. This would make the 
property selection algorithm practical but also greedy. Unfortunately, in 
constrast to this approach, we cannot directly maximize the gain of adding 
property c as a function of parameter a since the incomplete-data log- 
likelihood L is not concave in the parameters. However, we can define an 
auxiliary function similar to the one used in parameter estimation to ex- 
press an approximate gain as a conservative estimate of the log-likelihood 
difference. A possible definition of an approximate gain can be derived 



from an instantiation of the auxiliary function A of Sect. 5.1 to yl(Q;-|- A), 
denoting the extension of a log-linear model p\ with property-function 
vector v hy a single property c with log-parameter a. 

Definition 7. Let <f> = X-v he a weighted property Junction, c he a candi- 
date property, and a G IR the log-parameter corresponding to c. Then the 
approximate gain Gcia-\-X) of adding candidate property c with parameter 
value a to the log-linear model px is defined s.t. 

G,{a + A) = + kx[ac] - px[e-^^]) 

where kx{x) = ^J^^^^^.^^y Px{x) = Z^^e^<^-)p^{x). 

For this function similar properties hold as for the auxiliary function 



A of Sect. 5.1. Since Gc is strictly concave in the parameters, we can 



maximize it directly and uniquely as a function of a. 

Proposition 9. For each A G A, a G H; Gc{a + A) takes its maximum 
as a function of a at the unique point a satisfying 



®In the following we will refer to the property corresponding to property function 
c as the "property c". 
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Proof. £G,{a + A) = E,6>.(fcA[c] - pa[c 

Property selection then will incorporate that property out of the set of 
candidates that gives greatest improvement to the model at the property's 
best adjusted parameter value. Since we are interested only in relative, not 
absolute gains, a single, non-iterative maximization of the approximate 
gain will suffice to choose from the candidates. 

Definition 8 (property selection). Let C be a set of candidate proper- 
ties, c G C be a candidate property with log-parameter a € TR, and 
Gc(A) = maXaGc{a + A) the maximal approximate gain that property 
c can give to model p\ . Then c is selected in a property selection step for 
model p\ if c = argmaXc'ecGc>{X)■ 
5.3. Summary. The combined incomplete-data induction algorithm for 
log-linear models can be summarized as follows. 
Input: Initial model po, incomplete data set y. 

Output: Log-linear model p* on complete data set X — \Sy^y^(j)) 
with selected property function vector v* and log-parameter vector 
A* — argmax\^AL{X) where A — {A| p\ is a log-linear model on X 
based on pq, v* and A G H"}. 

Algorithm: 1. p*^"^ =Po, 

2. Property selection: For each candidate property c G C*-"-*, com- 
pute the gain G'c(A'")) = maxQ,g]R,Gc(a -I- A^"-*) and select prop- 
erty c — argmaXc^(j(n)Gc{XS"'^). 

3. Parameter estimation: Compute the maximum likelihood param- 
eter value A = argmax\^f^L{X) where A = {A| p\{x) is a log- 
linear distribution on X with initial model pq , property function 
vector V = v^'^^ U c, and A £ H"}. 

4. Set = + go to2. 

Returning to the sample program of Fig. 0, we can find a simple log- 
linear reformulation of the probabilistic CLP model as follows. In order 
to distinguish between the possible proof trees of Fig. || it is sufficient to 
define a single property referring to the variable binding either to a or 
to h. Taking a parameter value of log 2 for a single property involving 
the variable binding to a will yield the desired probability distribution 
p{xi) — 2/3, p{x2) = 1/3 and incomplete-data log-likelihood L — .148. 
The same result is obtained by taking a parameter value of log 1/2 for a 
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single property involving the variable binding to b. All other properties 
will be unable to distinguish between proof trees xi and X2 and thus give 
a uniform distribution over the proof trees and log- likelihood L = .125. 



6. Approximation Methods 

With the algorithms and proofs of the preceding section at hand, in- 
duction of log-linear models from incomplete data reduces to a calcula- 
tion of expectations of simple functions. This calculation can be done by 
an explicit summation over the configuration space only for probabilistic 
processing models with a small, finite set of possible analyses. In case of 
large or infinite configuration spaces and complex parameter spaces these 
expectations can get intractable both analytically and numerically. Here 
approximation methods have to be used. 



Following Delia Pietra, Delia Pietra, and Lafferty (1995) and Abney 
(1996| ) , we can use a combination of the approximation techniques of New- 



ton's method and Monte Carlo methods. In order to give a self-contained 
recipe for inducing log-linear models from incomplete data, we will make 
the proposed use of these methods explicit in the following. 

Newton's method is a technique to approximate the solution a of an 
equation /(a) = by using a sequence of linearizations of /. At each 
step the intersection of the tangent to / at at with the a-axis is taken, 
yielding an improved estimate at+i- The iteration formulae to approach 
the solution up to a desired accuracy are defined as follows: 

at+i — at — pt^a}) "^here f'{at) is the derivative of f at at- 

This method directly suits our application when we replace f{a) by 
the first derivative of the auxiliary function A, ^^(7 + A), in case of 
parameter estimation, and by the first derivative of the approximate gain 
Gc, -^Gcia + X), in case of property selection. Newton's method usually 
converges rapidly for such functions. 

The expectations expressed in the such defined Newton formulae then 
can be estimated by Monte Carlo methods. A Monte Carlo technique ap- 
plicable to our problem is the Metropolis-Hastings method. The strategy 
behind this method is to generate a random sample from a target distri- 
bution p via choosing a nominating matrix p' from which sampling is easy 
and performing a Bernoulli trial with parameter a to determine whether 
to accept or reject the nominated sample point. That means, this method 
converts a sampler for p' into a sampler for p via the evaluation matrix 
a. For our application, we can take as nominating matrix for each query 
y & y a. stochastic context-free CLP model p{x; tt) on X{y) as defined in 
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Sect. ^. From this stochastic derivation model samphng is easy and can be 
converted by a standard evaluation matrix to sampling from the desired 
log-linear distribution ^^(a^) on X(y). More formally, it can be shown that 
the distribution of the sampled random variables Xi will converge to the 
target distribution px as i ^ oo, i.e., we have: 

limi^ooP{Xi = x) = px{x) for all x G X(j)). 



Following standard textbooks such as Fishman (1996D , an application 
of the Metropolis-Hastings algorithm to our problem is as follows. 

Input: initial state xq € X{y), 

nominating matrix p' = p{x; tt) on X{y), 
log-linear distribution p — px{x) on X{y), 

( 1 if p{x)p'{z) <p{z)p'{x) 

evaluation matrix a^.z = \ p(z)p'(x) ■ r t \ // \ ^ ^ \ /^ \ ' 

[ p{x)p'{z) ^.f Pi^)P > p{z)p {x) 

terminal number of steps k. 
Output: random sample Xq, . . . , Xk from px on X{y). 
Algorithm: 

Xq := Xq, 

i:^l , 
While i < k 

X := Xi-i, 

Randomly generate z from p' , 
If z = , then X, := Xj_i, 

Else evaluate a^.z, 

Randomly generate u from uniform distribution on [0, 1], 
If u < oix,z , then Xi :— z , 
Else Xi :— Xi-i, 
i := i + 1, 
return Xq , . . . , Xk ■ 

In general, a proper random sample from a probability distribution 
p allows the estimation of expectations of functions / with respect to p 
directly from the sample points Xi, i.e., we have: 

limK^ooji Yln=i fi^i) = Hx f{x)p{x). 

For our application taking a random sample X{y) from px on X(y) for 
a query y G y will allow us to calculate expectations of functions with 
respect to the distribution px on X(y). A combination of i-ary random 
samples X{y) from px on X{y) for queries y E y will yield a combined 
random sample X = [Jy^y X{y) from px on X — [Jy^y X{y). From this 
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random sample we can then estimate expectations of functions with re- 
spect to a distribution of p\ on X. 

Retiirning to the estimation of the expectations involved in our induc- 
tion formulae, we note that we can use the same random sample from p^") 
for each iteration of Newton's method in estimating the gain Gc{X^"^) for 
each candidate property c £ C*^"^ simultaneously. After adding a selected 
property c to the model, we can again use a single random sample from 
the extended model for the estimation of the maximum likelihood pa- 
rameter values via Newton's method for each property in parallel. This 
means that we can build up hash-tables counting up how many times 
each property takes on which value. Let y be an incomplete data sample 
of size N, X{y) be a complete data sample of size M for y, and X he a 
combined complete data sample of size L. Then the relevant hash-tables 
can be defined as follows: 

1. Sc,v = ^xexl^i^) — number of times property function c 

takes value v in combined random sam,ple X , 

2- Ty^cv = S5ex(j/)I''(^) ~ number of times property function 

c takes value v in random sample X{y), 

3. Ui^rn. = Y.i^x\ v#{x)=m ^i(^) number of times property Xi ap- 

pears in comMned random sample X when there is a total number 
of m property instances for each sample point. 

Furthermore, it will be convenient to define the following variables: 

Sr{a,c) = J2v^c,ve"'"v'^, 

The expectations involved in Newton's formulae for the property selection 
task can then be approximated by random sample counts as follows: 



^Gciat + A) 

at+i = at ' 



at + 



-tiGMt + A) 

fcA[c]-A^PA[ce"'^] 



at + ^ 



S2{at,c) 



Similar estimation formulae can be obtained for the task of parameter 
estimation: 
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OLt+l 



at + 



A) 



at 



at 



62 



^^(7 + A) 

fcAh]-iVpAhe"^'^*] 

N f 

-jjui[at,i) 



7. Search Methods 

The induction and approximation techniques of the preceding sections 
provide the means to induce a proper probabihty distribution over anal- 
yses of a log-hnear probabihstic processing model from unanalyzed input 
data. In case of ambiguity, this allows us to distinguish between analy- 
ses according to a well-defined and practical quality measure. However, 
if we are interested only in the best analysis of a given input, so far a 
ranking of analyses requires a listing of analyses in order to choose the 
best one. Clearly, it would be nice to have search techniques like Viterbi's 



algorithm (Viterbi 1967), which works well for probabilisitic processing 
models based on context-free stochastic derivation processes. 

Viterbi's algorithm is built upon a table of derivation states, called a 
chart, describing different pending derivations. During derivation, each 
state must keep track of the most probable path of states leading towards 
it. When the final state is reached, the maximum probability derivation 
can be recovered by tracing back the path of the best predecessor states. 
Different specifications of the algorithm depend on the chosen parsing 
strategy and the underlying probabilistic model. 

In the following we will sketch one possibility to transfer these ideas to 
a method of probabilistic parsing in the area of CLP. For this aim we rely 
on the well-known parsing algorithm of Barley deduction. This technique 
provides the necessary chart structure accompanied with a simple parsing 
strategy. Depending on the specific definition of the property vector in the 
underlying log-linear CLP model, different definitions of the propagation 
of probabilities during the parsing process are possible. Since the property 
vector is considered to be an open parameter in our setting, we will not 
present a definitive solution to this problem but only give some rules of 
thumb how to proceed for some general examples. 
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Earley deduction was introduced by Pereira and Warren (198^) as a 
generalization of Barley's context-free parsing algorithm (see Aho and Ull 
man (1972| )) to a parsing algorithm for definite clause grammars. Exten- 



sions of this method in the general setting of the CLP scheme of Hohfeld 
and Smolka (198^) h ave been presented, e.g., by Dorre (1993D and Dorrc 



and Johnson (1995[ ). The basic concepts of Earley deduction for CLP 
can be described as follows: Earley deduction works on two sets of defi- 
nite clauses, the set of program clauses V and the set of derived clauses 
constituting the chart C. An active item corresponds to a definite clause 
with at least one relational atom on its righthandside, i.e., to a non-unit 
clause. Passive items correspond to clauses whose righthandsides consist 
only of an £ -constraint, i.e., to unit-clauses. The input to the algorithm 
consists of a set of program clauses V and a query G. The content of the 
chart C initially consists of G and is continually added to by the following 
inference rules: 

Prediction: 

C2 - (^2 ^ B2) eV 
C3 = (C ^ u (/)) e C 

where c\ is non-unit, C2 is unit or non-unit, G is the selected 
literal in Bi, (j) is the C -constraint in Bi, and there exists a 
variant c'^ ^ {G B'^) of 02 s.t. V(ci) n V(B^) C V(C). 

Completion: 

ci - (H.^B,) eC 



C3 - (^^1 ^ {Bi \C)UB!,) eC 



where ci is non-unit, C2 is unit, C is the selected literal in Bi, 
and there exists a variant C2 = {G ^ B'2) of 02 s.t. V(ci) fl 
V(i?^) C V(C). 

A probabilistic version of a context-free Earley parser was presented 



in 3tolcke (1993 ). In this framework, during derivation each completed 
state keeps track of the most probable path of states contributing to 
it. The probability propagation is done recursively by associating each 
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predicted state with the probabihty of the corresponding rule and taking 
at each completion step the maximum of all products of probabilities of 
two states from which the completed state is derivable. When the final 
state is reached, the most probable analysis easily can be retrieved by 
building up a tree in accordance with the most probable path of states 
leading to the final completion. 

If the property vector of a log-linear CLP model is defined s.t. proper- 
tics arc identified with program clauses, then the above model can be used 
also for probabilistic Earlcy deduction: During deduction, each predicted 
clause is associated with a weight corresponding to the clause-property 
used in the prediction. For each completed clause, the pair of clauses con- 
tributing with maximal product of weights to the completion is recorded. 
Given a procedure to construct a proof tree from a sequence of clauses 
linked by prediction and completion, the highest weighted partial proof 
tree corresponding to a completed clause can be constructed recursively 
and uniquely from the highest weighted pair of clauses contributing to 
the completion. 

Unfortunately, weight propagation will get more complicated as we al- 
low more complicated properties in our underlying log-linear CLP model. 
In case properties are identified with program clauses, completion means 
complete reduction of selected atoms using appropriate clauses. A numer- 
ical comparison between different ways of arriving at the same completed 
state can be done at every completion step. In contrast to this, if proper- 
ties are allowed to be subtrees of proof trees, completion means completely 
building up a subtree of a proof tree during derivation. A numerical com- 
parison between to ways of "completing" the same subtree in the same 
completion state might have to wait for several completion steps until the 
subtree is completely built up. Considering the possibility of a backward 
construction of the most probable proof tree in this sc;tting. we cannot rely 
on an easy recording of the most probable path of clauses leading to the 
final completion state. Instead, in order to compare between the weights 
of the partial derivations contributing to such a "subtree-completion" , we 
have to incrementally build up partial proof trees and check their prop- 
erties during derivation. 

Let subtree-properties be defined as follows: A subtree of a proof tree is 
a connected subgraph of a proof tree, each node of a subgraph has either 
zero descendants or the same number of descendants as the corresponding 
node of the supergraph, and the node sets of every two subtrees in the set 
of properties do not intersect. Then a simple recursive procedure to build 
up partial proof trees from completed states can be defined as follows: 
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For each completed state Ck, for each pair of states Ci, Cj from 
which Cfc is derivable by completion, the partial proof tree tij 
corresponding to the completion of state Cfe from states Ci,Cj 
is constructed s.t. tij = 
U 

1. (B , if Ci,Cj are completed states with trees ti,tj, 



A . C 

, tl I .J. : I 

and e = suc2/ti = A,*2 = ^ 

«2 I I . 

D B : 



E 

2. ig) , if Ci is a predicted state {E <— F) with tree tf = \ , 



F 



Cj is a completed state with tree t 



3> 



A C 
, tl I -f I 

and ® = B if tl = A , *2 

t2 I I 

B\C U D B 



D 



E 

3. I , if Ci is a predicted state {E ^ F) with tree u = \ 



F 



Cj is a predicted state including £ -constraint (p 
and with tree tj = (j). 

During derivation, for each "property-completion" at some completed 
state Cfc, the variable tk denoting the partial proof tree corresponding to 
Cfe is instantiated to the most probable partial proof tree tij which can be 
built from all states Cj, Cj contributing to the completion of Cfe: 

Let p\ be a, log-linear distribution on the set X of proof trees 
of a constraint logic program V with property vector x '^'^^^ 
property function vector v. Then for each completed state Ck, 
for each property Xn G X> for each partial proof tree tij con- 
structable for Cfc from trees ti, tj s.t. Vn{Uj) > ^niU) + '^n{tj), 
set tk = argmaxti^ PxiUj)- 

For the above definition of subtree-properties, this procedure guaran- 
tees that the most probable proof tree is built up during derivation. The 
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possible savings in computational complexity induced by this procedure 
clearly depend on the size of the subtree-properties to be worked out 
during derivation. However, if subtree-properties are allowed to be over- 
lapping or disconnected subgraphs of proof trees, then the above dynamic 
programming approach is no longer applicable. In this case either exhaus- 
tive search or approximation methods are required. 



8. Conclusion 

We presented a log-linear probability model for probabilistic CLP. On 
top of this model we defined an algorithm to estimate the parameters 
and to select the properties of log-linear models from incomplete data. 



This algorithm is an extension of the iterative scaling algorithm of Delia 



Pietra, Delia Pictra, and Laffcrty (1995 ) adjusted to incomplete data. 



The algorithm applies to log-linear models in general and is accompanied 
with suitable approximation methods when applied to large data spaces. 
Furthermore, we presented an approach to search for most probable anal- 
yses of the probabilistic CLP model. This can be useful for the ambiguity 
resolution problem in natural language processing applications. 

Compared with Abney's approach to a log-linear model for stochas- 
tic attribute-value grammars, our approach adds the important aspect 
of incomplete data to the parameter estimation and property selection 
problem. Furthermore, we investigate the problem of searching for best 



analyses which is not addressed by Abney (1996) 



The expressive power of log-linear models even allows us to couch other 
approaches to probabilisitic processing beyond context-freeness in terms 
of this framework. Statistical decision trees as used in the probabilistic 
parsing model of Magerman Xl99|) can be cast in the log- linear framework 



by encoding the questions building up a decision tree as binary-valued, 
disjoint property functions. Property selection then can be seen as closely 
related to growing a decision tree and iterative maximization can be seen 
as maximum likelihood estimation for such defined decision trees. How- 



ever, in contrast to the algorithms used by Magerman (1994 ), which re- 
quire large samples of complete data, our approach allows induction of 
the probabilistic model from incomplete data. 

A similar statement can be made for the probabilistic tree substitution 



model of Bod (1995 ). This approach can be couched as a log- linear model 
employing all subtrees of a tree bank, which is annotated according to 
some grammar framework, as properties of the model. Again, Bod's ap- 
proach relies on hand-analyzed data and does not allow to estimate the 
probabilistic model from unanalyzed input. Furthermore, this approach 
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does not provide a means to automatically select subtree-properties from 
the exponentially many candidates. 

Clearly, our model of probabilistic CLP is not the last word on proba- 
bilistic processing beyond context-freeness. As mentioned above, log-linear 
models are closely related to other probabilistic models such as random 
f ields (Seman 19901), gra phical networks ( [Pearl 1988| ) or neural networks 
( Ackley and Hinton 198£ ) . Future work should exploit this resemblance in 
order to learn from related techniques to induce, approximate or search in 
log-linear probability models. Furthermore, the possibilities of our power- 
ful processing model shall be applied to natural language processing tasks 
other than ambiguity resolution. 
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